Classifying Documents Without Labels

نویسندگان

  • Daniel Barbará
  • Carlotta Domeniconi
  • Ning Kang
چکیده

Automatic classification of documents is an important area of research with many applications in the fields of document searching, forensics and others. Methods to perform classification of text rely on the existence of a sample of documents whose class labels are known. However, in many situations, obtaining this sample may not be an easy (or even possible) task. Consider for instance, a set of documents that is returned as a result of a query. If we want to separate the documents that are truly relevant to the query from those that are not, it is unlikely that we will have at hand labelled documents to train classification models to perform this task. In this paper we focus on the classification of an unlabelled set of documents into two classes: relevant and irrelevant, given a topic of interest. By dividing the set of documents into buckets (for instance, answers returned by different search engines), and using association rule mining to find common sets of words among the buckets, we can efficiently obtain a sample of documents that has a large percentage of relevant ones. (I.e., a high “purity”.) This sample can be used to train models to classify the entire set of documents. We try several methods of classification to separate the documents, including Two-class SVM, for which we develop a heuristic to identify a small sample of negative examples. We prove, via experimentation, that our method is capable of accurately classify a set of documents into relevant and irrelevant classes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multilabel Classification of Documents with Mapreduce

Multilabel classification is the problem of assigning a set of positive labels to an instance and recently it is highly required in applications like protein function classification, music categorization, gene classification and document classification for easy identification and retrieving of information. Labeling the documents of the web manually is a time consuming and a difficult task due t...

متن کامل

Toward an Information Theoretic Approach to Managing Multiple Decision Makers

Citizen science and human computation involves working with multiple, untrusted decision makers. We demonstrate how Bayesian Classifier Combination outperforms a naive Bayes method when classifying documents using unreliable crowdsourced labels. We also present methods for screening workers and selecting informative documents to label. Finally, we explain how the Bayesian Classifier Combination...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Cross-Lingual Dataless Classification for Many Languages

Dataless text classification [Chang et al., 2008] is a classification paradigm which maps documents into a given label space without requiring any annotated training data. This paper explores a crosslingual variant of this paradigm, where documents in multiple languages are classified into an English label space. We use CLESA (cross-lingual explicit semantic analysis) to embed both foreign lang...

متن کامل

Reducing the Dimensionality of Bag-of-Words Text Representation Used by Learning Algorithms

The attribute-value representation of documents used in Text Mining provides a natural framework for classifying or clustering documents based on their content. Supervised learning algorithms can be applied whenever the documents have labels preassigned or unsupervised learning for unlabeled documents. The attribute-value representation of documents is characterized by very high dimensional dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004